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Abstract 



This article presents novel results concerning the recovery of signals from undersampled data 

in the common situation where such signals arc not sparse in an orthonormal basis or incoherent 

dictionary, but in a truly redundant dictionary. This work thus bridges a gap in the literature 

and shows not only that compressed sensing is viable in this context, but also that accurate 

recovery is possible via an €i-analysis optimization problem. We introduce a condition on the 

r^ measurement/sensing matrix, which is a natural generalization of the now well-known restricted 

t^ isometry property, and which guarantees accurate recovery of signals that are nearly sparse in 

(possibly) highly overcomplete and coherent dictionaries. This condition imposes no incoherence 

restriction on the dictionary and our results may be the first of this kind. We discuss practical 

examples and the implications of our results on those applications, and complement our study 

C^ by demonstrating the potential of £i-analysis for such problems. 

> 



1 Introduction 



Compressed sensing is a new data acquisition theory based on the discovery that one can exploit 
^^ sparsity or compressibility when acquiring signals of general interest, and that one can design 

•O nonadaptive sampling techniques that condense the information in a compressible signal into a 

. . small amount of data [131 [TBI I18j . In a nutshell, reliable, nonadaptive data acquisition, with far 

. !_( fewer measurements than traditionally assumed, is possible. By now, applications of compressed 

^ sensing are abundant and range from imaging and error correction to radar and remote sensing, 

^ see [21 U] and references therein. 

In a nutshell, compressed sensing proposes acquiring a signal x G M"' by collecting m linear 
measurements of the form y^ = (a^, x) + z^, 1 < A; < m, or in matrix notation, 

y = Ax + z; (1.1) 

A is an m X n sensing matrix with m typically smaller than n by one or several orders of magnitude 
(indicating some significant undersampling) and z is an error term modeling measurement errors. 
Sensing is nonadaptive in that A does not depend on x. Then the theory asserts that if the unknown 
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signal X is reasonably sparse, or approximately sparse, it is possible to recover x, under suitable 
conditions on the matrix A, by convex programming: we simply find the solution to 

min ||x||i subject to ll^i — ylh^ej (-^i) 

where || • II2 denotes the standard Euclidean norm, ||2;||i = ^ |a;j| is the £i-norm and e^ is a 
likely upper bound on the noise power ||2:||2- (There are other algorithmic approaches to com- 
pressed sensing based on greedy algorithms such as Orthogonal Matching Pursuit [HI Il2] , Iterative 
Thresholding [71[23], Compressive Sampling Matching Pursuit [33], and many others.) 

Quantitatively, a concrete example of a typical result in compressed sensing compares the quality 



of the reconstruction from the data y and the model (1.1) with that available if one had an oracle 
giving us perfect knowledge about the most significant entries of the unknown signal x. Define 
- here and throughout - by Xg the vector consisting of the s largest coefficients of 2; G M" in 
magnitude: 

Xs = argmin ||x — x||2, (1-2) 

l|x-j|o<s 

where ||x||o = \{i : xi 7^ 0}|. In words, Xs is the best s-sparse approximation to the vector x, where 
we shall say that a vector is s-sparse if it has at most s nonzero entries. Put differently, x — x^ is 
the tail of the signal, consisting of the smallest n — s entries of x. In particular, if x is s-sparse, 
X — Xs = 0. Then with this in mind, one of the authors [9] improved on the work of Candes, 
Romberg and Tao |14j and established that (Li) recovers a signal x obeying 

II ^ II ^ /^ W'^ ^s 1 , ,^ /I o\ 

\\x-x\\2<Co T= \-Cie, (1.3) 

provided that the 2s -restricted isometry constant of A obeys ^2^ < V^ — 1. The constants in this 
result have been further improved, and it is now known to hold when 62s < 0.4652 [24j) see also 
|25j . In short, the recovery error from (Li) is proportional to the measurement error and the tail 
of the signal. This means that for compressible signals, those whose coefficients obey a power law 
decay, the approximation error is very small, and for exactly sparse signals it completely vanishes. 
The definition of restricted isometries first appeared in [15] where it was shown to yield the 



error bound (1.3) in the noiseless setting, i. e. when e = and z = 0. 



Definition 1.1 For an m x n measurement matrix A, the s-restricted isometry constant 5s of A 
is the smallest quantity such that 

(1 - ds)\\xg < \\Axg < {1 + 6s)\\xg 

holds for all s-sparse signals x. 



With this, the condition underlying (1.3) is fairly natural since it is interpreted as preventing 
sparse signals from being in the nullspace of the sensing matrix A. Further, a matrix having a 
small restricted isometry constant essentially means that every subset of s or fewer columns is 
approximately an orthonormal system. It is now well known that many types of random measure- 
ment matrices have small restricted isometry constants [161132^ [37 1 16]. For example, matrices with 
Gaussian or Bernoulli entries have small restricted isometry constants with very high probability 
whenever the number of measurements m is on the order of slog{n/s). The fast multiply matrix 
consisting of randomly chosen rows of the discrete Fourier matrix also has small restricted isometry 
constants with very high probability with m on the order of s(logn)^. 



1.1 Motivation 

The techniques above hold for signals which are sparse in the standard coordinate basis or sparse 
with respect to some other orthonormal basis. However, there are numerous practical examples in 
which a signal of interest is not sparse in an orthonormal basis. More often than not, sparsity is 
expressed not in terms of an orthonormal basis but in terms of an overcomplete dictionary. This 
means that our signal / E M" is now expressed as / = Dx where D € M"^ is some overcomplete 
dictionary in which there are possibly many more columns than rows. The use of overcomplete 
dictionaries is now widespread in signal processing and data analysis, and we give two reasons why 
this is so. The first is that there may not be any sparsifying orthonormal basis, as when the signal 
is expressed using curvelets |1H [TO] or time- frequency atoms as in the Gabor representation [22] . 
In these cases and others, no good orthobases are known to exist and researchers work with tight 
frames. The second reason is that the research community has come to appreciate and rely on the 
flexibility and convenience offered by overcomplete representations. In linear inverse problems such 
as deconvolution and tomography for example - and even in straight signal-denoising problems 
where A is the identity matrix ~ people have found overcomplete representations to be extremely 
helpful in reducing artifacts and mean squared error (MSE) [39', '40] . It is only natural to expect 
overcomplete representations to be equally helpful in compressed sensing problems which, after all, 
are special inverse problems. 

Although there are countless applications for which the signal of interest is represented by some 
overcomplete dictionary, the compressed sensing literature is lacking on the subject. Consider the 
simple case in which the sensing matrix A has Gaussian (standard normal) entries. Then the 
matrix AD relating the observed data with the assumed (nearly) sparse coefficient sequence x has 
independent rows but each row is sampled from J\f{0, S), where S = D*D. If D is an orthonormal 
basis, then these entries are just independent standard normal variables, but if -D is not unitary then 
the entries are correlated, and AD may no longer satisfy the requirements imposed by traditional 
compressed sensing assumptions. In [35] recovery results are obtained when the sensing matrix A 
is of the form ^D* where $ satisfies the restricted isometry property. In this case the sampling 
matrix must depend on the dictionary D in which the signal is sparse. We look for a universal 
result which allows the sensing matrix to be independent from the signal and its representation. To 
be sure, we are not aware of any such results in the literature guaranteeing good recovery properties 
when the columns may be highly - and even perfectly - correlated. 

Before continuing, it might be best to fix ideas to give some examples of applications in which 
redundant dictionaries are of crucial importance. 

Oversampled DFT The Discrete Fourier Transform (DFT) matrix is an n x n orthogonal matrix 
whose fcth column is given by 



dk{t) = —j=(^ 
Jn 



'2-Kikt/n 

with the convention that 0<t, fc<n — 1. Signals which are sparse with respect to the DFT 
are only those which are superpositions of sinusoids with frequencies appearing in the lattice 
of those in the DFT. In practice, we of course rarely encounter such signals. To account for 
this, one can consider the oversampled DFT in which the sampled frequencies are taken over 
even smaller equally spaced intervals, or at small intervals of varying lengths. This leads to 
an overcomplete frame whose columns may be highly correlated. 

Gabor frames Recall that for a fixed function g and positive time-frequency shift parameters a 



and b, the kth column (where k is the double index k = (fci, ^2)) of the Gabor frame is given 
by 

Gfc(t)=5(i-Me'"^'^'*- (1-4) 

Radar and sonar along with other imaging systems appear in many engineering applications, 
and the goal is to recover pulse trains given by 

j=l J 

Due to the time- frequency structure of these applications, Gabor frames are widely used [50] . 
If one wishes to recover pulse trains from compressive samples by using a Gabor dictionary, 
standard results do not apply. 

Curvelet frames Curvelets provide a multiscale decomposition of images, and have geometric 
features that set them apart from wavelets and the likes. Conceptually, the curvelet trans- 
form is a multiscale pyramid with many directions and positions at each length scale, and 
needle-shaped elements at fine scales [llj. The transform gets its name from the fact that it 
approximates well the curved singularities in an image. This transform has many properties 
of an orthonormal basis, but is overcomplete. Written in matrix form D, it is a tight frame 
obeying the Parseval relations 

f = Y.^f,dk)dk and ||/i = J^ K/,4)|', 
k k 

where we let {d^} denote the columns of D. Although columns of D far apart from one another 
are very uncorrelated, columns close to one another have high correlation. Thus none of the 
results in compressed sensing apply for signals represented in the curvelet domain. 

Wavelet Frames The undecimated wavelet transform (UWT) is a wavelet transform achieving 
translation invariance, a property that is missing in the discrete wavelet transform (DWT) |20j . 
The UWT lacks the downsamplers and upsamplers in the DWT but upsamples the filter co- 
efficients by a factor of 2*" at the {m — l)st level - hence it is overcomplete. Also, the Unitary 
Extension Principle of Ron and Shen [36j facilitates tight wavelet frame constructions for 
L^(]R'^) which may have many more wavelets than in the orthonormal case. This redundancy 
has been found to be helpful in image processing (see e.g. |39]), and so one wishes for a 
recovery result allowing for significant redundancy and/or correlation. 

Concatenations In many applications a signal may not be sparse in a single orthonormal basis, 
but instead is sparse over several orthonormal bases. For example, a linear combination 
of spikes and sines will be sparse when using a concatenation of the coordinate and Fourier 
bases. One also benefits by exploiting geometry and pointwise singularities in images by using 
combinations of tight frame coefficients such as curvelets, wavelets, and brushlets. However, 
due to the correlation between the columns of these concatenated bases, current compressed 
sensing technology does not apply. 

These and other applications strongly motivate the need for results applicable when the dictio- 
nary is redundant and has correlations. This state of affair, however, exposes a large gap in the 
literature since current compressed sensing theory only applies when the dictionary is an orthonor- 
mal basis, or when the dictionary is extremely uncorrelated (see e.g. \26 \ 138 1 15]). 



1.2 Do we really need incoherence? 

Current assumptions in the field of compressed sensing and sparse signal recovery impose that 
the measurement matrix have uncorrelated columns. To be formal, one defines the coherence of a 
matrix M as 

m„Mk)\ 



KM) 



max ■ 
j<k 



MihWM. 



where Mj and Mk denote columns of M. We say that a dictionary is incoherent if fi is small. 
Standard results then require that the measurement matrix satisfy a strict incoherence property 
|41lll2j . as even the RIP imposes this. If the dictionary D is highly coherent, then the matrix AD 
will also be coherent in general. 

Coherence is in some sense a natural property in the compressed sensing framework, for if two 
columns are closely correlated, it will be impossible in general to distinguish whether the energy in 
the signal comes from one or the otherj^ For example, imagine that we are not undersampling and 
that A is the identity so that we observe y = Dx. Suppose the first two columns are identical, di = 
d2- Then the measurement di can be explained by the input vectors (1, 0, . . . , 0) or (0, 1, 0, . . . , 0) 
or any convex combination. Thus there is no hope of reconstructing a unique sparse signal x from 
measurements y = ADx. However, we are not interested in recovering the coefficient vector x, but 
rather the actual signal Dx. The large correlation between columns in D now does not impose a 
problem because although it makes it impossible to tell apart coefficient vectors, this is not the 
goal. This simple example suggests that perhaps coherence is not necessary. If D is coherent, then 
we clearly cannot recover x as in our example, but we may certainly be able to recover the signal 
/ = Dx from measurements y = Af as we shall see next. 
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Figure 1: The compressed sensing process and its domains. This distinguishes the domains 
in which the measurements, signals, and coefficients reside 



1.3 Gaussian sensing matrices 

To introduce our results, it might be best for pedagogical purposes to discuss a concrete situation 
first, and we here assume that the sensing matrix has iid Gaussian entries. In practice, signals are 
never exactly sparse, and dictionaries are typically designed to make D* f for some classes of / as 
sparse as possible. Therefore, in this paper, we propose a reconstruction from y = Af + z hy the 
method of ^i-analysis: 

/ = argmin||L>*/||i subject to \\Af -y\\2<e, (Pi) 



^ Recall that when the dictionary D is sufficiently incoherent, standard compressed sensing guarantees that we 
recover x and thus / = Dx, provided x is s-sparse with s sufficiently small. 



where again e is a likely upper bound on the noise level ||2;||2- Empirical studies have shown very 
promising results for the £i-analysis problem. Its geometry has been studied [21] as well as its 
applications to image restoration [8]. However, there are no results in the literature about its 
performance. 

Our main result is that the solution to (Pi) is very accurate provided that D*f has rapidly 
decreasing coefficients. Our result for the Gaussian case is below while the general theorem appears 
in Section [l.5[ 



Theorem 1.2 Let D be an arbitrary nx d tight frame and let A be a mx n Gaussian matrix with 
m on the order of slog{d/s). Then the solution f to (Pi) obeys 

||/-/|b<Co. + Ci "^*^-y^^^-"\ 



for some numerical constants Cq and Ci, and where {D*f)s is the vector consisting of the largest 



s entries of D* f in magnitude as in (1.2). 



We have assumed that D is a tight frame although this is simply to make the analysis easier and 
is of course not necessary. Having said this, our results proves not only that compressed sensing 
is viable with highly coherent dictionaries, but also that the £i-analysis problem provably works 
in this setting. We are not aware of any other result of this kind. To be sure, other methods for 
redundant dictionaries such as [6(i\ E] force incoherence on the dictionary D so that the matrix 
AD conforms to standard compressed sensing results. The method in [35j requires that the sensing 
matrix depend on the dictionary. These are drastically different from the setting here, where we 
impose no such properties on the dictionary. We point out that our result holds even when the 
coherence of the dictionary D is maximal, meaning two columns are completely correlated. Finally, 
we also note that the dependence on the noise level is optimal and that the tail bound in the error 



is analogous to previous bounds in the non-redundant case such as (1.3). 



1.4 Implications 



As we mentioned, the dependence on the noise in the error given by Theorem 1.2 is optimal, and 
so we need only discuss how the second term affects the estimation error. This term will of course 
be negligible when the norm of the tail, D* f — {D* f)s, is small. Hence, the result says that for 
any dictionary, signals / such that D* f decays rapidly can be approximately reconstructed using 
£i-analysis from just a few random measurements. This is exactly the case for many dictionaries 
used in practice and many classes of signals as discussed earlier. As a side remark, one can also 
guarantee rapid decay of D* f (we assume the signal expansion / = Dx) when D*D is well behaved 
and the coefficient vector x is nearly sparse. To see why this is true, suppose D is a tight frame so 
that D* f = D*Dx. A norm commonly used to quantify sparsity is the quasi p-norm with p < 1 
defined via ||x||p = ^^ \xif (sparser signals with unit 2-norm have smaller p-norms). Now a simple 
calculation shows that 



\D*f\\p< maxV|(D*Z)),,-|P 



i/p 



In words, if the columns of the Gram matrix are reasonably sparse and if / happens to have 
a sparse expansion, then the frame coefficient sequence D* f is also sparse. All the transforms 



discussed above, namely, the Gabor, curvelet, wavelet frame, oversampled Fourier transform all 
have nearly diagonal Gram matrices - and thus, sparse columns. 

We now turn to the implications of our result to the applications we have already mentioned, 
and instantiate the theorem in the noiseless case due to the optimality of the noise level in the 
error. 

Multitone signals To recover multitone signals, we use an oversampled DFT, which is not or- 
thonormal and may have very large coherence. However, since each "off-grid" tone has a 
rapidly decaying expansion, D* f will have rapidly decaying coefficientsjj Thus our result 
implies that the recovery error is negligible when the number of measurements is about the 
number of tones times a log factor. 

Radar For radar and sonar applications using Gabor dictionaries, our result similarly implies a neg- 
ligible error. Indeed, with notation as in (1.4), one sees that the sequence {(tt;(t)e*'^*, Gkit))}k 



decays quickly (each pulse has a rapidly decaying expansion). Therefore, our result implies 
a negligible error when the number of measurements is roughly the number of pulses in the 
pulse train, up to a log factor. 

Images Roughly speaking, the curvelet coefficient sequence of an arbitrary image, which is dis- 
continuous along piecewise-C^ edges but is otherwise smooth, decays like k~^''^ - up to a 
log factor - when arranged in a decreasing order of magnitude. Hence, our theorem asserts 
that one can get an £2 error of about s~^ from about slogn random samples of /. This is 
interesting since this is the approximation error one would get by bandlimiting the image 
to a spatial frequency about equal to s or, equivalently, by observing the first s^ Fourier 
coefficients of /. So even though we do not know where the edges are, this says that one 
can sense an image nonadaptively m times, and get a recovery error which is as good as that 
one would traditionally get by taking a much larger number - about m? - of samples. This 
is a considerable gain. Of course, similar implications hold when the undecimated wavelet 
transform of those images under study decay rapidly. 

Concatenations When working with signals which are sparse over several orthonormal bases, it 
is natural to use a dictionary D consisting of a concatenation of these bases. For example, 
consider the dictionary D which is the concatenation of the identity and the Fourier basis 
(ignoring normalizations for now). Then D*D is made up of four blocks, two of which are the 
identity and two of which are the DFT, and does not have sparse columns. Then even when 
/ is sparse in D, the coefficients of D* f may be spread. If this is the case, then the theorem 
does not provide a good error bound. This should not be a surprise however, for if D*f is 
not close to a sparse signal, then we do not expect / to be the minimizer of the £i-norm in 
(Pi). In this case, ^i-analysis is simply not the right method to use. 

To summarize, we see that in the majority of the applications, our theorem yields good recovery 
results. As seen in the last example, the ^i-analysis method only makes sense when D* f has quickly 
decaying coefficients, which may not be the case for concatenations of orthonormal bases. However, 
this is not always the case, as we see in the following. 



■^In practice, one smoothly localizes the data to a time interval by means of a nice window w to eliminate effects 
having to do with a lack of periodicity. One can then think of the trigonometric exponentials as smoothly vanishing 
at both ends of the time interval under study. 



An easy example. As above, let D be the n x 2n dictionary consisting of a concatenation of 
the identity and the DFT, normahzed to ensure D is a tight frame (below, F is the DFT normalized 
to be an isometry): 

We wish to create a sparse signal that uses linearly dependent columns for which there is no local 
isometry. Assume that n is a perfect square and consider the Dirac comb 

f{t) = Y^6{t-jv^), 

which is a superposition of spikes spread ^/n apart. Thus our signal is a sparse linear combination 
of spikes and sines, something that by the last example alone we would not expect to be able to 
recover. However, D*f = [f /]/\/2 is exactly sparse implying that ||-D*/ — (-D*/)s||i = when 
s > 2-^/n. Thus our result shows that £i-analysis can exactly recover the Dirac comb consisting of 
spikes and sines from just a few general linear functionals. 

1.5 Axiomization 

We now turn to the generalization of the above result, and give broader conditions about the sensing 
matrix under which the recovery algorithm performs well. We will impose a natural property on 
the measurement matrix, analogous to the restricted isometry property. 

Definition 1.3 (D-RIP) Let Sg be the union of all subspaces spanned by all subsets of s columns 
of D. We say that the measurement matrix A obeys the restricted isometry property adapted to D 
(abbreviated D-RIP) with constant 6s if 

(l - 6s)\\v\\l < \\Av\\l < {I + 6s)\\v\\l 

holds for all v £ Tig. 

We point out that S^ is just the image under D of all s-sparse vectors. Thus the D-RIP is a 
natural extension to the standard RIP. We will see easily that Gaussian matrices and other random 
compressed sensing matrices satisfy the D-RIP. In fact any mxn matrix A obeying for fixed v G M", 

P((l - 6)\\v\\l < \\Av\\l < (1 + 6)\\vg) < Ce-^- (1.5) 

(7 is an arbitrary positive numerical constant) will satisfy the D-RIP with overwhelming probability, 
provided that m > slog{d/s). This can be seen by a standard covering argument (see e.g. the proof 



of Lemma 2.1 in [3S]). Many types of random matrices satisfy (1.5). It is now well known that 



matrices with Gaussian, subgaussian, or Bernoulli entries satisfy (1.5) with number of measurements 



m on the order of slog{d/s) (see e.g. |n|)- It has also been shown [32j that if the rows of A are 



independent (scaled) copies of an isotropic V2 vector, then A also satisfies (1.5). Recall that an 
isotropic ip2 vector a is one that satisfies for all v, 

E\{a,v)\'^ = \\vf and mf{t ■.Eexp{{a,vf /t"^) <2} < a\\v\\2, 



for some constant a. See [32] for further details. Finally, it is clear that if A is any of the above 

random matrices then for any fixed unitary matrix U, the matrix AU will also satisfy the condition. 

The D-RIP can also be analyzed via the Johnson-Lindenstrauss lemma (see e.g. |27|[3]). There 

are many results that show certain types of matrices satisfy this lemma, and these would then 



satisfy the D-RIP via (1.5). Subsequent to our submission of this manuscript, Ward and Krahmer 
showed that randomizing the column signs of any matrix that satisfies the standard RIP yields a 
matrix which satisfies the Johnson-Lindenstrauss lemma |29j. Therefore, nearly all random matrix 
constructions which satisfy standard RIP compressed sensing requirements will also satisfy the 
D-RIP. A particularly important consequence is that because the randomly subsampled Fourier 
matrix is known to satisfy the RIP, this matrix along with a random sign matrix will thus satisfy 



D-RIP. This gives a fast transform which satisfies the D-RIP. See Section 4.2 for more discussion. 
We are now prepared to state our main result. 

Theorem 1.4 Let D be an arbitrary tight frame and let A be a measurement matrix satisfying 
D-RIP with 62s < 0.08. Then the solution f to (Pi) satisfies 



where the constants Cq and Ci may only depend on 5; 



2s- 



Remarks. We actually prove that the theorem holds under the weaker condition 8ts < 0.6, 
however we have not tried to optimize the dependence on the values of the restricted isometry 
constants; refinements analagous to those in the compressed sensing literature are likely to improve 
the condition. Further, we note that since Gaussian matrices with m on the order of slog(d/,s) 



obey the D-RIP, Theorem 1.2 is a special case of Theorem 1.4 



1.6 Organization 



The rest of the paper is organized as follows. In Section [2] we prove our main result. Theorem 1.4 



Section |3] contains numerical studies highlighting the impact of our main result on some of the 
applications previously mentioned. In Section [4] we discuss further the implications of our result 
along with its advantages and challenges. We compare it to other methods proposed in the literature 
and suggest an additional method to overcome some impediments. 

2 Proof of Main Result 



We now begin the proof of Theorem 1.4, which is inspired by that in [14J. The new challenge here 
is that although we can still take advantage of sparsity, the vector possessing the sparse property 
is not being multiplied by something that satisfies the RIP, as in the standard compressed sensing 
case. Rather than bounding the tail of / — / by its largest coefficients as in p3], we bound a portion 
of D*h in an analagous way. We then utilize the D-RIP and the fact that D is a tight frame to 
bound the error, ||/ — /II2. 

Let / and / be as in the theorem, and let Tq denote the set of the largest s coefficients of D* f 
in magnitude. We will denote by Dt the matrix D restricted to the columns indexed by T, and 
write Dy to mean (Dt)* ■ With h = f — f, our goal is to bound the norm of h. We will do this in 
a sequence of short lemmas. The first is a simple consequence of the fact that / is the minimizer. 

9 



Lemma 2.1 (Cone Constraint) The vector D*h obeys the following cone constraint, 

\\D*Tch\U<2\\D*Tcf\U + \\D*T,h\\i. 

Proof. Since both / and / are feasible but / is tlie minimizer, we must have ||D*/||i < ||D*/||i. 
We then have that 

\\D*Tjh + \\D*Tgfh = \\D*fh>\\D*f\U 

= \\D*f-D*h\\i 

> \\D*Tjh - WDnHi - \\D*Tsf\\i + WDtsHi- 



This imphes the desired cone constraint. ■ 

We next divide the coordinates Tq into sets of size M (to be chosen later) in order of decreasing 
magnitude of D^ch. Call these sets Ti,T2, . . ., and for simplicity of notation set Tqi = TqU Ti. We 
then bound the tail of D*h. 

Lemma 2.2 (Bounding the tail) Setting p = s/M and rj = 2\\D^cf\\i/y/s, we have the follow- 
ing bound, 

J2\\D*T,hh<VPi\\D*Tohh + v). 
i>2 

Proof. By construction of the sets Tj, we have that each coefficient of D^, h, written \D^, h\(i^\, 
is at most the average of those on Tj : 

I^T,+,^IW < \\Dt,Hi/M. 
Squaring these terms and summing yields 

\\Dh+M\l < \\D*T^h\\l/M. 



This along with the cone constraint in Lemma 2.1 gives 



Y. \\D*TMh < E \\D*T,h\\i/VM = \\D*Tch\\,/VM. 
i>2 j>i 



With p = s/M and rj = 2\\D^cf\\l/^/s, it follows from Lemma 2.1 and the Cauchy-Schwarz 
inequality that 

Y,\\D*T,h\\2<VPi\\D*Toh\\2 + 7j), 
i>2 

as desired. ■ 

Next we observe that by the feasibility of /, Ah must be small. 

Lemma 2.3 (Tube Constraint) The vector Ah satisfies the following, 

\\Ah\\2 < 2e. 



10 



Proof. Since / is feasible, we have 

\\Ah\\2 = \\Af - Afh < \\Af - yh + \\Af - yh <e + e = 2e. 

We will now need the following result which utilizes the fact that D satisfies the D-RIP. 
Lemma 2.4 (Consequence of D-RIP) The following inequality holds, 



Proof. Since D is a tight frame, DD* is the identity, and this along with the D-RIP and Lemma 2.2 
then imply the following: 

2e> \\Ah\\2 = \\ADD*h\\2 

> \\ADT,,D*T^,h\\2 - Y, \\ADt^D*t^\\2 

i>2 

> V^-^s+m\\Dt,,D*t,M\2 - a/p(1 + <5a/)(Pto^II2 + V)- 

Since we also have 111)^/1112 < ll^lb, this yields the desired result. ■ 

We now translate these bounds to the bound of the actual error, ||/i||2- 

Lemma 2.5 (Bounding the error) The error vector h has norm that satisfies, 

\\hg < \\h\\2\\DTo,D*j,^M\2 + p{\\D*T,h\\2 + r?)2. 

Proof. Since D* is an isometry, we have 

\\h\\l = \\D*hg = \\D*j,^^^hg + \\D*Tc^h\\l 



= {h,DT,,D*T^,,h) + \\D*T.h\\l 

<\\h\\2\\DT,,D*j^,,h\\2 + \\D*T.^h\\l 

<\\h\\2\\DT,,D*T^,h\h + Pi\\D*Tohh + V?, 



where the last inequality follows from Lemma 2.2 

We next observe an elementary fact that will be useful. The proof is omitted. 

Lemma 2.6 For any values u, v and c > 0, we have 



uv < 1 . 

- 2 2c 
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We may now conclude the proof of Theorem 1.4 First we employ Lemma 2.6 twice to the 



inequality given by Lemma 2.5 (with constants ci, C2 to be chosen later) and the bound \\D^ ^Ib < 
1 2 to get 



Hli<^ 



Cl 



I ^ \\DTo,D*T,M\2 , 



2ci 
n* 

2ci 



|^pTcnI?^o,/^|li 



< 



Cl 



+ 



2ci 



+ p||/i||2 + 2pr?||/i||2+w2 
+ p\\h\\l + 2p 



\ , II^Toi^Toi'^lli , „||j,||2 , o„/''^2||/l||2 



'+l^)+^^'- 



Simplifying, this yields 

(- 



Cl 

2 



P- PC2 



<^}Dr,.D*T.nHi+[f^+Py- 



Using the fact that ^/v? + v^ < n + w for u, u > 0, we can further simply to get our desired 
lower bound, 



|I?ToiI??^o,/^||2>||/^||2V2ci(l-(|+P + pC2))-W2ci(^+p). 



(2.1) 



Combining (2.1) with Lemma 2.4 implies 

2e>Kx\\h\\2-K2ri, 



where 



K, 



2ci(l - (5s+m) f 1 - (y + P + PC2)J - \J p{\ + (^m), and 



i^2 = V2ci(1-<5,+m)(/0/c2+p) - \//9(l+(^Af). 

It only remains to choose the parameters ci, C2, and M so that i^i is positive. We choose ci=l, 

M = 6s, and take C2 arbitrarily small so that K\ is positive when b-js < 0.6. Tighter restrictions 

on b'js will of course force the constants in the error bound to be smaller. For example, if we set 

Cl = 1/2, C2 = 1/10, and choose M = 6s, we have that whenever b-js < 1/2 that (Pi) reconstructs 

/ satisfying 

ll-Dr-/lli 
||/-/||2<62g + 30 y . 

Note that if b^s is even a little smaller, say b^s ^ 1/4, the constants in the theorem are just 
Cl = 10.3 and C2 = 7.33. Note further that by Corollary 3.4 of [3l], bis < 0.6 is satisfied whenever 
(52s ^ 0.08. This completes the proof. ■ 
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3 Numerical Results 

We now present some numerical experiments illustrating the effectiveness of recovery via ^i-analysis 
and also compare the method to other alternatives. Our results confirm that in practice, ^i-analysis 
reconstructs signals represented in truly redundant dictionaries, and that this recovery is robust 
with respect to noise. 

In these experiments, we test the performance on a simulated real-world signal from the field of 
radar detection. The test input is a superposition of six radar pulses. Each pulse has a duration of 
about 200 ns, and each pulse envelope is trapezoidal, with a 20 ns rise and fall time, see Figure [2j 
For each pulse, the carrier frequency is chosen uniformly at random from the range 50 MHz to 2.5 
GHz. The Nyquist interval for such signals is thus 0.2 ns. Lastly, the arrival times are distributed 
at random in a time interval ranging from t = s to t ~ 1.64 /xs; that is, the time interval under 
study contains n = 8192 Nyquist intervals. We acquire this signal by taking 400 measurements 
only, so that the sensing matrix ^ is a Gaussian matrix with 400 rows. The dictionary D is a Gabor 
dictionary with Gaussian windows, oversampled by a factor of about 60 so that d ~ 60 x 8, 192 = 
491, 520. The main comment about this setup is that the signal of interest is not exactly sparse in 
D since each pulse envelope is not Gaussian (the columns of D are pulses with Gaussian shapes) 
and since both the frequencies and arrival times are sampled from a continuous grid (and thus do 
not match those in the dictionary). 



Pulse envelopes; thin black line is spectrum ol full signal. N = 8192, 5 GSPS Nyquist 



0.2 0.4 0.6 0.8 




Power Spectral Density; thin biacl< iine is spectrum of tuil signal 




Figure 2: Input signal in the time and frequency domains. The signal of interest is a super- 
position of 6 radar pulses, each of which being about 200 ns long, and with frequency carriers 
distributed between 50 MHz and 2.5 GHz (top plot). As can be seen, three of these pulses 
overlap in the time domain. 

Figure p^ shows the recovery (without noise) by £i -analysis in both the time and frequency 
domains. In the time domain we see (in red) that the difference between the actual signal and the 
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recovered signal is small, as we do in the frequencey domain as well. These pulses together with 
the carrier frequencies are well recovered from a very small set of measurements. 




ill 
miP 




0.2 0.4 0.6 0.8 



1.2 1.4 1.6 



Figure 3: Recovery in both the time (below) and frequency (above) domains by £i-analysis. 
Blue denotes the recovered signal, green the actual signal, and red the difference between the 
two. 



In practice, reweighting the li norm often offers superior results. We use the reweighted li- 
analysis method, which solves several sequential weighted £i-minimization problems, each using 
weights computed from the solution of the previous problem [T7| . This procedure has been observed 
to be very effective in reducing the number of measurements needed for recovery, and outperforms 
standard £i-minimization in many situations (see e.g. |17) . [28], [33] )• Figured shows reconstruction 
results after just one reweighting iteration; the root-mean squared error (RMSE) is significantly 
reduced, by a factor between 3 and 4. 

Because D is massively overcomplete, the Gram matrix D*D is not diagonal. Figure [5] depicts 
part of the Gram matrix D*D for this dictionary, and shows that this matrix is "thick" off of 
the diagonal. We can observe visually that the dictionary D is not an orthogonal system or even 
a matrix with low coherence, and that columns of this dictionary are indeed highly correlated. 
Having said this, the second plot in Figure [5] shows the rapid decay of the sequence D* f where / 
is the signal in Figure [2| 

Our next simulation studies the robustness of £i-analysis with respect to noise in the measure- 
ments y = Af + z, where z is a white noise sequence with standard deviation a. Figure [6] shows 
the recovery error as a function of the noise level. As expected, the relationship is linear, and this 



simulation shows that the constants in Theorem 1.4 seem to be quite small. This plot also shows 
the recovery error with respect to noise using a reweighted £i-analysis; reweighting also improves 
performance of £i-analysis, as is seen in Figure [6j 
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Figure 4: Recovery in both the time (below) and frequency (above) domains by £i -analysis 
after one reweighted iteration. Blue denotes the recovered signal, green the actual signal, and 
red the difference between the two. The RMSE is less than a third of that in Figure l4| 



An alternative to £i-analysis is ^i-synthesis, which we discuss in Section [4. 1[ ^i-synthesis min- 
imizes in the coefficient domain, so its solution is a vector x, and we set / = Dx. Our next 
simulation confirms that although we cannot recover the coefficient vector x, we can still recover 
the signal of interest. Figure [7] shows the largest 200 coefficients of the coefficient vector x, and 
those of D* f as well as D* f for both £i-analysis and £i-synthesis. The plot also shows that the 
recovery of £i-analysis with reweighting outperforms both standard £i-analysis and ^i-synthesis. 

Our final simulation compares recovery error on a compressible signal (in the time domain) for 
the ^i-analysis, reweighted £i-analysis, and £i-synthesis methods. We see in Figure [8] that the li- 
analysis and ^i-synthesis methods both provide very good results, and that reweighted ^i-analysis 
provides even better recovery error. 



4 Discussion 



Theorem 1.4 shows that ^i-analysis is accurate when the coefficients of D* f are sparse or decay 
rapidly. As discussed above, this occurs in many important applications. However, if it is not the 
case, then the theorem does not guarantee good recovery. As previously mentioned, this may occur 
when the dictionary D is a concatenation of two (even orthonormal) bases. For example, a signal / 
may be decomposed as / = /i + /2 where /i is sparse in the basis Di and /2 is sparse in a different 
basis, D2. One can consider the case where these bases are the coordinate and Fourier bases, or 
the curvelet and wavelet bases, for example. In these cases, D* f is likely to decay slowly since 
the component that is sparse in one basis is not at all sparse in the other [19]. This suggests that 
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Figure 5: Portion of the matrix D*D, in log-scale (left), 
absolute value) of the signal from Figure l2| (right). 



Sorted analysis coefficients (in 



^i-analysis may then not be the right algorithm for reconstruction in such situations. 



4.1 Alternatives 

Even though £i-analysis may not work well in this type of setup, one should still be able to take 
advantage of the sparsity in the problem. We therefore suggest a modification of ^i-analysis which 
we call Split-analysis. As the name suggests, this problem splits up the signal into the components 
we expect to be sparse: 



(/l,/2 



argmm 

fuh 



I^i/i||i + p2/2||i subject to P(/i + /2)-y||2<e. 



The reconstructed signal would then be / = /i + /2 . Some applications of this problem in the area 
of image restoration have been studied in [8]. Since this is an analagous problem to ^i-analysis, 



one would hope to have a result for Split-analysis similar to Theorem 1.4 

An alternative way to exploit the sparsity in / = /i + /2 is to observe that there may still exist a 
(nearly) sparse expansion / = Dx = Dixi + D2X2- Thus one may ask that if the coefficient vector 
X is assumed sparse, why not just minimize in this domain? This reasoning leads to an additional 
approach, called £i-Synthesis or Basis Pursuit (see also the discussion in [21j): 



argmin ||x||i subject to ||ylDx — y||2 < e. 



(^i-synthesis) 



The reconstructed signal is then / = Dx. Empirical studies also show that £i-synthesis often 
provides good recovery, however, it is fundamentally distinct from £i-analysis. The geometry of the 
two problems is analyzed in |21J, and there it is shown that because these geometrical structures 
exhibit substantially different properties, there is a large gap between the two formulations. This 
theoretical gap is also demonstrated by numerical simulations in [21], which show that the two 
methods perform very differently on large families of signals. 



16 




Figure 6: Relative recovery error of ^i-analysis as a function of the (normalized) noise level, 
averaged over 5 trials. The solid line denotes standard ^i-analysis, and the dashed line denotes 
^i-analysis with 3 reweighted iterations. The a;-axis is the relative noise level -Jrna/\\Af\\2 
while the y-axis is the relative error ||/ — /||2/||/||2- 



4.2 Fast Transforms 

For practical reasons, it is clearly advantageous to be able to use measurement matrices A which 
allow for easy storage and fast multiplication. The partial DFT for example, exploits the Fast 
Fourier Transform (FFT) which allows the sampling matrix to be applied to a n-dimensional vector 
in O(nlogn) time, and requires only 0(7nlogn) storage. Since the partial DFT has been proven 
to satisfy the RIP [16] (see also [37]), it is a fast measurement matrix that can be used in many 
standard compressed sensing techniques. 

One of course hopes that fast measurement matrices can be used in the case of redundant 
and coherent dictionaries as well. As mentioned, the result of Krahmer and Ward implies that any 
matrix which satisfies the RIP will satisfy the D-RIP when multiplied by a random sign matrix [29] . 
Therefore, the m x n subsampled Fourier matrix with m = 0(slog n) along with the sign matrix 
will satisfy the D-RIP and provides the fast multiply. 

A result at the origin of this notion was proved by Ailon and Liberty, also after our initial 



submission of this paper ^. Recall that in light of (1.5), we desire a fast transform that satisfies 
the Johnson-Lindenstrauss lemma in the following sense. For a set Q of N vectors in n-dimensional 
space, we would like a fast transform A that maps this set into a space of dimension O(logA^) 
(possibly with other factors logarithmic in n) such that 



(l-<5)||w||2 < \\Av\\i < (1 + 5)11^112 for ahweQ. 



(4.1) 



Note that the dimension 0(log A^) will of course also depend on the constant 6. Due to standard 
covering arguments (see e.g. |38l [6]). this would yield amxn fast transform with optimal number 
of measurements, m = O(slogn), obeying the D-RIP. 

Ailon and Liberty show that the subsampled Fourier matrix multiplied by a random sign matrix 
does exactly this |[4j. Thus in other words, for a fixed m, this mxn construction satisfies the D-RIP 
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Figure 7: The largest 200 coefficients of the coefficient vector D* f (blue), D* f from ii- 
analysis (dashed green), D* f from ^i-analysis with 3 reweighting iterations (dashed red), x 
from ^i-synthesis (cyan), and D* f from ^i-synthesis (magenta). 

up to sparsity level s = 0{m/ log n). The cost of a matrix-vector multiply is of course dominated 
by that of the FFT, O(nlogn). Its storage requirements are also 0(?TT,logn). Their results can also 
be generalized to other transforms with the same type of fast multiply. 

These results yield a transform with a fast multiply which satisfies the D-RIP. The number of 
measurements and the multiply and storage costs of the matrix are of the same magnitude as those 
that satisfy the RIP. The D-RIP is, therefore, satisfied by matrices with the same benefits as those 
in standard compressed sensing. This shows that compressed sensing with redundant and coherent 
dictionaries is viable with completely the same advantages as in the standard setting. 
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